Introduction:

In this project the red wine data will be analysed. The main aim of the project is to understand which of variables in the dataset impact the quality of the wine. This will be understood by performing Exploratory Data Analysis(EDA) on the dataset. We will perform Univariate analysis, Bivariate Analysis and Multivariate analysis on the variables to understand the data and variables.

setwd("/udacity")

getwd()
## [1] "C:/udacity"
redWineData <- read.csv(file="c:/udacity/wineQualityReds.csv", header=TRUE, sep=",")

The data has been loaded into the redWineData, we will be running the str function on the dataset to view the variables present.

str(redWineData)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
dim(redWineData)
## [1] 1599   13

Observation regarding the dataset

  • There are 1599 rows and 13 columns(variables) in the dataframe
  • We need to understand which of the variable has impact on the quality variable.
  • Few points regarding the variables Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
  • Some description for the variables: 1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Creating a new variable rating, this will categorise the wines based on the quality and total acidity.

redWineData$rating <- ifelse(redWineData$quality < 5, 'bad', ifelse(
  redWineData$quality < 7, 'average', 'good'))

redWineData$rating <- ordered(redWineData$rating,
                       levels = c('bad', 'average', 'good'))

redWineData$total_acidity <- redWineData$fixed.acidity + redWineData$volatile.acidity

Check if the newly added column are preset in the dataframe

str(redWineData)
## 'data.frame':    1599 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ rating              : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
##  $ total_acidity       : num  8.1 8.68 8.56 11.48 8.1 ...
summary(redWineData)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality          rating     total_acidity   
##  Min.   : 8.40   Min.   :3.000   bad    :  63   Min.   : 5.120  
##  1st Qu.: 9.50   1st Qu.:5.000   average:1319   1st Qu.: 7.680  
##  Median :10.20   Median :6.000   good   : 217   Median : 8.445  
##  Mean   :10.42   Mean   :5.636                  Mean   : 8.847  
##  3rd Qu.:11.10   3rd Qu.:6.000                  3rd Qu.: 9.740  
##  Max.   :14.90   Max.   :8.000                  Max.   :16.285

Univariate Plot Section

The individual variables will be analysed before finding their impact on the quality of wine. This will help us understand the nature of each variable.

Import all the required libraries

library("ggplot2")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("gridExtra")
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
#library(Simpsons)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
## 
##     collect, recode, rename
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
library(pander)
## 
## Attaching package: 'pander'
## The following object is masked from 'package:GGally':
## 
##     wrap
library(corrplot)
## corrplot 0.84 loaded
library(MASS)
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:memisc':
## 
##     %nin%, html
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(easyGgplot2)

Here is plot for quality:

ggplot(data=redWineData, aes(x=quality)) + geom_bar(width = 1, color = 'Brown', fill ='sky blue')

describe(redWineData$quality)
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 1599 5.64 0.81      6    5.59 1.48   3   8     5 0.22     0.29
##      se
## X1 0.02
stat.desc(redWineData$quality)
##      nbr.val     nbr.null       nbr.na          min          max 
## 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00 
##        range          sum       median         mean      SE.mean 
## 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02 
## CI.mean.0.95          var      std.dev     coef.var 
## 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01

Output

describe(redWineData$quality) vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 5.64 0.81 6 5.59 1.48 3 8 5 0.22 0.29 0.02

stat.desc(redWineData$quality) nbr.val nbr.null nbr.na min max 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00 range sum median mean SE.mean 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02 CI.mean.0.95 var std.dev coef.var 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01

Plot is for the other variables

#function to plot graph with x and y limits
plot_no_lim <- function (var,var_name)
{
  title <- paste(var_name," distribution")
  
grid.arrange(ggplot(redWineData, aes( x = 1, y = var ) ) + 
               geom_jitter(alpha = 0.1 ) +
               geom_boxplot(alpha = 0.2, color = 'red' ) +
               scale_y_continuous() + labs(y = var_name) + ggtitle(title),
ggplot(data = redWineData, aes(x = var)) +
  geom_histogram(binwidth = 1, color = 'black',fill = I('orange')) + 
  scale_x_continuous() + labs(x = var_name) + ggtitle(title),ncol = 2)
  
describe(var)
}

#function to plot graphs with x and y limits set to understand the distribution of data
plot_with_lim <- function (var,l1,l2,var_name,b1)
{
   title <- paste(var_name," distribution")
grid.arrange(ggplot(redWineData, aes( x = 1, y = var ) ) + 
               geom_jitter(alpha = 0.1 ) +
               geom_boxplot(alpha = 0.2, color = 'red' ) +
               scale_y_continuous(lim = c(l1,l2)) + labs(y = var_name) +ggtitle(title),
ggplot(data = redWineData, aes(x = var)) +
  geom_histogram(binwidth = b1, color = 'black',fill = I('orange')) + 
  scale_x_continuous(lim = c(l1,l2)) + labs(x = var_name) + ggtitle(title),ncol = 2)
}

#plot the graphs for each of the variable of interest to understand the distrinution
plot_no_lim(redWineData$fixed.acidity,"fixed.acidity")

##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1599 8.32 1.74    7.9    8.15 1.48 4.6 15.9  11.3 0.98     1.12
##      se
## X1 0.04
#setting the limits for x and y axis to 4 and 14 as there is maximum distribution of data between these values
plot_with_lim(redWineData$fixed.acidity,4,14,"fixed.acidity",1)
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing non-finite values (stat_bin).

Result of describe function vars n mean sd median trimmed mad min max range skew kurtosis X1 1 1599 8.32 1.74 7.9 8.15 1.48 4.6 15.9 11.3 0.98 1.12 se X1 0.04

#plotting for volatile acidity
plot_no_lim(redWineData$volatile.acidity,"volatile.acidity")

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1599 0.53 0.18   0.52    0.52 0.18 0.12 1.58  1.46 0.67     1.21
##    se
## X1  0
#setting the limits for x and y axis to 0 and 1 as there is maximum distribution of data between these values
plot_with_lim(redWineData$volatile.acidity,0,1,"volatile.acidity",0.25)
## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing non-finite values (stat_bin).

Output of Describe() vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.53 0.18 0.52 0.52 0.18 0.12 1.58 1.46 0.67 1.21 0

#plotting for ph
plot_no_lim(redWineData$pH,"pH")

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1599 3.31 0.15   3.31    3.31 0.15 2.74 4.01  1.27 0.19      0.8
##    se
## X1  0
#setting the limits for x and y axis 
plot_with_lim(redWineData$pH,0,6,"pH",0.2)

Output vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 3.31 0.15 3.31 3.31 0.15 2.74 4.01 1.27 0.19 0.8 0

#plotting for citric.acid
plot_no_lim(redWineData$citric.acid,"citric.acid")

##    vars    n mean   sd median trimmed  mad min max range skew kurtosis se
## X1    1 1599 0.27 0.19   0.26    0.26 0.25   0   1     1 0.32    -0.79  0
#setting the limits for x and y axis 
plot_with_lim(redWineData$citric.acid,-1,2,"citric.acid",0.08)

vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.27 0.19 0.26 0.26 0.25 0 1 1 0.32 -0.79 0

#plotting for residual.sugar
plot_no_lim(redWineData$residual.sugar,"residual.sugar")

##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1599 2.54 1.41    2.2    2.26 0.44 0.9 15.5  14.6 4.53    28.49
##      se
## X1 0.04
#setting the limits for x and y axis 
plot_with_lim(redWineData$residual.sugar,1,8,"residual.sugar",0.1)
## Warning: Removed 23 rows containing non-finite values (stat_boxplot).
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing non-finite values (stat_bin).

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 2.54 1.41 2.2 2.26 0.44 0.9 15.5 14.6 4.53 28.49 0.04

#plotting for citric.acid
plot_no_lim(redWineData$chlorides,"chlorides")

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1599 0.09 0.05   0.08    0.08 0.01 0.01 0.61   0.6 5.67    41.53
##    se
## X1  0
#setting the limits for x and y axis 
plot_with_lim(redWineData$chlorides,0,0.5,"chlorides",0.02)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_bin).

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.09 0.05 0.08 0.08 0.01 0.01 0.61 0.6 5.67 41.53 0

#plotting for free.sulphur.dioxide
plot_no_lim(redWineData$free.sulfur.dioxide,"free.sulfur.dioxide")

##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 1599 15.87 10.46     14   14.58 10.38   1  72    71 1.25     2.01
##      se
## X1 0.26
#setting the limits for x and y axis 
plot_with_lim(redWineData$free.sulfur.dioxide,0,45,"free.sulfur.dioxide",1)
## Warning: Removed 24 rows containing non-finite values (stat_boxplot).
## Warning: Removed 26 rows containing missing values (geom_point).
## Warning: Removed 24 rows containing non-finite values (stat_bin).

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 15.87 10.46 14 14.58 10.38 1 72 71 1.25 2.01 0.26

#plotting for total.sulphur.dioxide
plot_no_lim(redWineData$total.sulfur.dioxide,"total.sulfur.dioxide")

##    vars    n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 1599 46.47 32.9     38   41.84 26.69   6 289   283 1.51     3.79
##      se
## X1 0.82
#setting the limits for x and y axis 
plot_with_lim(redWineData$total.sulfur.dioxide,0,180,"total.sulfur.dioxide",5)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_bin).

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 46.47 32.9 38 41.84 26.69 6 289 283 1.51 3.79 0.82

#plotting for sulphates
plot_no_lim(redWineData$sulphates,"sulphates")

##    vars    n mean   sd median trimmed  mad  min max range skew kurtosis se
## X1    1 1599 0.66 0.17   0.62    0.64 0.12 0.33   2  1.67 2.42    11.66  0
#setting the limits for x and y axis 
plot_with_lim(redWineData$sulphates,0,2,"sulphates",0.1)

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.66 0.17 0.62 0.64 0.12 0.33 2 1.67 2.42 11.66 0

#plotting for density
plot_no_lim(redWineData$density,"density")

##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 1599    1  0      1       1   0 0.99   1  0.01 0.07     0.92  0
#setting the limits for x and y axis 
plot_with_lim(redWineData$density,0.5,1.5,"density",0.001)

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 1 0 1 1 0 0.99 1 0.01 0.07 0.92 0

#plotting for alcohol
plot_no_lim(redWineData$alcohol,"alcohol")

##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1599 10.42 1.07   10.2   10.31 1.04 8.4 14.9   6.5 0.86     0.19
##      se
## X1 0.03
#setting the limits for x and y axis 
plot_with_lim(redWineData$alcohol,8,14,"alcohol",0.1)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 10.42 1.07 10.2 10.31 1.04 8.4 14.9 6.5 0.86 0.19 0.03

#plotting for total_acidity
plot_no_lim(redWineData$total_acidity,"total_acidity")

##    vars    n mean  sd median trimmed  mad  min   max range skew kurtosis
## X1    1 1599 8.85 1.7   8.45    8.69 1.42 5.12 16.29 11.17 0.97     1.23
##      se
## X1 0.04
#setting the limits for x and y axis 
plot_with_lim(redWineData$total_acidity,4,14,"total_acidity",0.1)
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing non-finite values (stat_bin).

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 8.85 1.7 8.45 8.69 1.42 5.12 16.29 11.17 0.97 1.23 0.04

Univariate Analysis

Observations in the univariate plots:

  • The variables alcohol,density,pH,fixed acidity,volatile acidity and citric acid are normally distributed as the the skewness is closer to 0.
  • The variables sulphate,total sulphur dioxide and free sulphur dioxide are slightly positively skewed in distribution
  • The variables chlorides and residual sugar are highly positive skewed distribution with outliers present in the extreme.
  • The citric acid variable has a large number of zero values.
  • Quality variable has maximum data in the average category (5 to 7), there are very few observations for good(>7) and bad (0-4) quality of wine.
  • For the maximum number of observations it is seen that the alcohol value is between 9 and 11.

What is the structure of your dataset?

The dataset has 13 variables and 1599 observation. The variables in the dataset are

fixed.acidity
volatile.acidity
citric.acid
residual.sugar
chlorides
free.sulfur.dioxide total.sulfur.dioxide density
pH
sulphates
alcohol
quality

What is/are the main feature(s) of interest in your dataset?

The variable of interest is quality. We want to study the ariables that have impact on quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The expectation is that citric acid,ph,residual sugar, alcohol and total acidity will contribute to the investigate the quality of wine. These factors contribute to the taste of wine determining its quality. So may be the mentioned variables to contribute to its impact on quality.

Did you create any new variables from existing variables in the dataset?

Yes , 2 new variables have been created. total_acidity, this the summation of the volatile acidity and fixed acidity as these 2 variables together determine the acidity of the wine. The second varaible created is rating, this categorises the wines based on their quality score in bad,average and good categories.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • The variables alcohol,density,pH,fixed acidity,volatile acidity and citric acid are normally distributed as the the skewness is closer to 0.
  • The variables sulphate,total sulphur dioxide and free sulphur dioxide are slightly positively skewed in distribution
  • The variables chlorides and residual sugar are highly positive skewed distribution with outliers present in the extreme.
  • The citric acid variable has a large number of zero values.

The x axis and y axis have been set to limits to have a closer view of the data. Plots with and after removing outliers have been plot to understand the distribution of data.

Bivariate Plot Section

We will first ploat a scatterplot matrix, to understand the relation between 2 variables.

#ggpairs(redWineData, aes(colour = rating, alpha = 0.4))
#set.seed(666)
#ggpairs(redWineData[sample.int(nrow(redWineData),1000),])
#To increase the readability we will plot 5 variables agains quality first and then plot the remaining
ggpairs(redWineData,columns= c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","quality"), columnLabels = c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","quality"),aes(colour = rating, alpha=0.4))

# plot graph for the remaining variables
ggpairs(redWineData,columns= c("free.sulfur.dioxide","total.sulfur.dioxide","pH","sulphates","alcohol","total_acidity","quality"), columnLabels = c("free.sulfur.dioxide","total.sulfur.dioxide","pH","sulphates","alcohol","total_acidity","quality"),aes(colour = rating, alpha=0.4))

## observations from the scatterplot matrix * There are no variables that have strong corelation with quality. * From comparison of the corelation coefficient all variables with quality, the below seem to have some relation with the quality alcohol(0.476), volatile acidity (-0.391), sulphates (0.476) and citric acid(0.226) * There is strong corelation between citric acid and fixed and volatile acidity. * There is strong co relation between total sulfur dioxide and free sulfur dioxode, ph and total acidity

# function to plot different variables against quality

plot_relation_graph <- function(xvar,yvar,a1,xvar_name,yvar_name)
{
  title <- paste(xvar_name, "Content vs", yvar_name, sep = " ")
  
 ggplot(aes(x=xvar, y=yvar), data=redWineData) + 
  geom_jitter(alpha=a1) + 
  geom_smooth(method = "lm", se = FALSE) + labs(x = xvar_name, y= yvar_name) +
  ggtitle(title)
}
#Plot relation between alcohol and quality
plot_relation_graph(redWineData$alcohol,redWineData$quality,0.66,"alcohol","quality")

#Plot relation between volatile acidity and quality
plot_relation_graph(redWineData$volatile.acidity,redWineData$quality,0.5,"volatile.acidity","quality")

#Plot relation between residual sugar and quality
plot_relation_graph(redWineData$residual.sugar,redWineData$quality,0.5,"residual.sugar","quality")

#Plot relation between alcohol and quality
plot_relation_graph(redWineData$pH,redWineData$quality,0.5,"pH","quality")

#Plot relation between sulphates and quality
plot_relation_graph(redWineData$sulphates,redWineData$quality,0.66,"sulphates","quality")

#Plot relation between citric acid and quality
plot_relation_graph(redWineData$citric.acid,redWineData$quality,0.5,"citric acid","quality")

#Plot relation between citric acid and volatile acidity
plot_relation_graph(redWineData$citric.acid,redWineData$volatile.acidity,0.2,"citric.acid","volatile.acidity")

#Plot relation between citric acid and fixed acidity
plot_relation_graph(redWineData$citric.acid,redWineData$fixed.acidity,0.2,"citric acid","fixed acidity")

#Plot relation between total sulphur dioxide and quality
plot_relation_graph(redWineData$total.sulfur.dioxide,redWineData$quality,0.2,"total sulphur dioxide","quality")

#Plot relation between free sulphur dioxide and quality
plot_relation_graph(redWineData$free.sulfur.dioxide,redWineData$quality,0.2,"free sulphur dioxide","quality")

#Plot relation between density and quality
plot_relation_graph(redWineData$density,redWineData$quality,0.2,"density","quality")

Now that there is a relation between the 4 variables and quality we will plot a box plot showing the content of the variables in the rating column

plot_rel_rating <- function(yvar,yvar_name)
{
  title <- paste0(yvar_name," vs Wine quality")
  ggplot(redWineData, aes(x=rating, y=yvar,fill=rating)) +
      geom_boxplot()+
      xlab("wine category") + ylab(yvar_name) +
      ggtitle(title)
}

plt1 <- plot_rel_rating(redWineData$alcohol,"Alcohol")
plt2 <- plot_rel_rating(redWineData$alcohol,"Sulphates")
plt3 <- plot_rel_rating(redWineData$alcohol,"Volatile.acidity")
plt4 <- plot_rel_rating(redWineData$alcohol,"Citric.acid")
grid.arrange(plt1,plt2,plt3,plt4)

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • There are no variables that display strong relationship with quality. Still there is relationship between alcohol,sulphates,volatile acidity, citric acid and quality Also there is very strong relation observed between citric acid and fixed and volatile acidity Better wines seem to have higher concentration of Citric Acid. Better wines seem to have higher alcohol percentages. Residual sugar has no impact on quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It has been observed that there is a strong relation between pH and total acidity. Also there has been a strong relation observed between citric acid and fixed acidity and citric acid and volatile acidity.

What was the strongest relationship you found?

Relative to quality, alcohol had the strongest relation. Relative to all other ‘different’ variables citric acid and fixed acidity have strong relation.

Multivariate Plot section

Now we will plot multiple variable plots to conclude on the factors that impact wine quality.

We have seen that alcohol has a strong relation with quality, hence we will try to plot different variables with alocohol and quality and try to understand if any of them together have impact on the quality of wine.

multi_rel_plot <- function (yvar,xvar)
{
ggplot(data = redWineData,
       aes_string(y = yvar, x = xvar,
           color = as.factor("quality"))) +
  geom_point(alpha = 0.8, size = 1) +
  geom_smooth(method = "lm", se = FALSE,size=1)  + labs(x = xvar,y= yvar) +
  scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"), guide=guide_legend(title="Quality"))
}
multi_rel_plot("density","alcohol")

multi_rel_plot("sulphates","alcohol")

multi_rel_plot("residual.sugar","alcohol")

multi_rel_plot("pH","alcohol")

Now we will plot graphs by fixing the acidity

multi_rel_plot("citric.acid","fixed.acidity")

multi_rel_plot("residual.sugar","fixed.acidity")

multi_rel_plot("density","volatile.acidity")

### Observations: Quality is high when volatile acidity and density are low Quality gets high with more alcohol and less sulphates Wine has good quality when the amount of alcohol is more and volatile acidity is less. Density has the weakest correlations with quality Residual sugar has no impact on quality

Plotting a graph based on the above observation

plot_scatter <- function(xvar_name,yvar_name)
{
  title <- paste(xvar_name," by ",yvar_name)
ggplot2.scatterplot(data=redWineData, xName=xvar_name, yName= yvar_name, size=3,
        groupName="rating", 
        groupColors=c('#999999','#E69F00','#56B4E9'),
        addRegLine=TRUE, fullrange=TRUE, setShapeByGroupName=TRUE,
        backgroundColor="white", 
        xtitle=xvar_name, ytitle=yvar_name,
        mainTitle=title,
        faceting=TRUE, facetingVarNames="rating"
        )
}
#plot for showing effect of volatile acidity and alcohol together
pl1 <- plot_scatter("alcohol","volatile.acidity")
#PLot for alcohol and sulphates together with rating
pl2 <- plot_scatter("alcohol","sulphates")
#PLot for alcohol and chlorides together with rating
pl3 <- plot_scatter("alcohol","chlorides")
#PLot for alcohol and citric acid  together with rating
pl4 <- plot_scatter("alcohol","citric.acid")

grid.arrange(pl1,pl2,pl3,pl4)

#plot a graph to volatile acidity and density on rating
plot_scatter("volatile.acidity","density")

#plot graph with residual.sugar and density to show that they have the weakest relation with quality of wine

p1 <- plot_scatter("alcohol","density")
p2 <- plot_scatter("alcohol","residual.sugar")
grid.arrange(p1,p2)

Linear Model

We will try to plot a linear model based on the data we haveanalysed so far:

plt1 <- lm(quality ~ alcohol, data = redWineData) summary(plt1)

#understanding the linear model
plt1 <- lm(quality ~ alcohol, data = redWineData)
summary(plt1)
## 
## Call:
## lm(formula = quality ~ alcohol, data = redWineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87497    0.17471   10.73   <2e-16 ***
## alcohol      0.36084    0.01668   21.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16
m1 <- lm((quality ~ alcohol), data = redWineData)
m2 <- update(m1, ~ . + citric.acid)
m3 <- update(m2, ~ . + chlorides)
m4 <- update(m3, ~ . + residual.sugar)
m5 <- update(m4, ~ . + total_acidity)
m6 <- update(m5, ~ . + sulphates)
mtable(m1, m2, m3, m4, m5,m6)
## 
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = redWineData)
## m2: lm(formula = quality ~ alcohol + citric.acid, data = redWineData)
## m3: lm(formula = quality ~ alcohol + citric.acid + chlorides, data = redWineData)
## m4: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar, 
##     data = redWineData)
## m5: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar + 
##     total_acidity, data = redWineData)
## m6: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar + 
##     total_acidity + sulphates, data = redWineData)
## 
## ======================================================================================================
##                        m1            m2            m3            m4            m5            m6       
## ------------------------------------------------------------------------------------------------------
##   (Intercept)         1.875***      1.830***      2.056***      2.085***      2.000***      1.719***  
##                      (0.175)       (0.171)       (0.186)       (0.187)       (0.233)       (0.228)    
##   alcohol             0.361***      0.346***      0.333***      0.334***      0.336***      0.311***  
##                      (0.017)       (0.016)       (0.017)       (0.017)       (0.017)       (0.017)    
##   citric.acid                       0.730***      0.798***      0.814***      0.767***      0.549***  
##                                    (0.090)       (0.092)       (0.093)       (0.121)       (0.120)    
##   chlorides                                      -1.218**      -1.200**      -1.179**      -2.564***  
##                                                  (0.389)       (0.390)       (0.391)       (0.408)    
##   residual.sugar                                               -0.017        -0.017        -0.010     
##                                                                (0.012)       (0.012)       (0.012)    
##   total_acidity                                                               0.008         0.009     
##                                                                              (0.013)       (0.013)    
##   sulphates                                                                                 1.068***  
##                                                                                            (0.113)    
## ------------------------------------------------------------------------------------------------------
##   R-squared           0.227         0.257         0.262         0.263         0.263         0.302     
##   adj. R-squared      0.226         0.256         0.261         0.261         0.261         0.300     
##   sigma               0.710         0.696         0.694         0.694         0.694         0.676     
##   F                 468.267       276.595       188.675       142.024       113.650       114.880     
##   p                   0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood  -1721.057     -1688.711     -1683.819     -1682.921     -1682.733     -1639.019     
##   Deviance          805.870       773.917       769.196       768.333       768.153       727.280     
##   AIC              3448.114      3385.421      3377.637      3377.842      3379.467      3294.037     
##   BIC              3464.245      3406.930      3404.523      3410.105      3417.106      3337.054     
##   N                1599          1599          1599          1599          1599          1599         
## ======================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There were few strong relationships that identified in bivariate analysis when combined together had impact on the quality of wine here are the observation: * Quality is high when volatile acidity and density are low * Quality gets high with more alcohol and less sulphates * Wine has good quality when the amount of alcohol is more and volatile acidity is less. Also there were few variables that when added alongwith alcohol showed no impact on the quality. * Density has the weakest correlations with quality * Residual sugar has no impact on quality

Were there any interesting or surprising interactions between features?

Earlier it was assumed that pH and citric acid will have great amount of impact on deciding the quality of the wine. But it was noted that these variables did not have significant impact on the quality of the wine. There were variables like volatile acidity and sulphates which if present in less amount will produce good quality wines.

Final Plots and Summary

Plot 1

ggplot(data=redWineData, aes(x=quality), title("Quality histogram
                                               ")) + geom_bar(width = 1, color = 'Brown', fill ='sky blue') 

##Plot 1 description

It can be noted that the dataset provided contains average quality wines. There are very few observations for good and bad quality wines. This constraint makes it difficult to determine the factors that will impact the quality of wine.

Plot 2

plot_rel_rating <- function(yvar,yvar_name)
{
  title <- paste0(yvar_name," vs Wine quality")
  ggplot(redWineData, aes(x=rating, y=yvar,fill=rating)) +
      geom_boxplot()+
      xlab("wine category") + ylab(yvar_name) +
      ggtitle(title)
}

plt1 <- plot_rel_rating(redWineData$alcohol,"Alcohol")
plt2 <- plot_rel_rating(redWineData$alcohol,"Sulphates")
plt3 <- plot_rel_rating(redWineData$alcohol,"Volatile.acidity")
plt4 <- plot_rel_rating(redWineData$alcohol,"Citric.acid")
grid.arrange(plt1,plt2,plt3,plt4)

Plot 2 description

The above plot show that alcohol,sulphates ,volatile.acidity and citric acid have strong corelation with the quality.

PLot 3

plot_scatter <- function(xvar_name,yvar_name)
{
  title <- paste(xvar_name," by ",yvar_name)
ggplot2.scatterplot(data=redWineData, xName=xvar_name, yName= yvar_name, size=3,
        groupName="rating", 
        groupColors=c('#999999','#E69F00','#56B4E9'),
        addRegLine=TRUE, fullrange=TRUE, setShapeByGroupName=TRUE,
        backgroundColor="white", 
        xtitle=xvar_name, ytitle=yvar_name,
        mainTitle=title,
        faceting=TRUE, facetingVarNames="rating"
        )
}
#plot for showing effect of volatile acidity and alcohol together
pl1 <- plot_scatter("alcohol","volatile.acidity")
#PLot for alcohol and sulphates together with rating
pl2 <- plot_scatter("alcohol","sulphates")
#PLot for alcohol and chlorides together with rating
pl3 <- plot_scatter("alcohol","chlorides")
#PLot for alcohol and citric acid  together with rating
pl4 <- plot_scatter("alcohol","citric.acid")

grid.arrange(pl1,pl2,pl3,pl4)

#Plot 3 description It is observed that we can get good quality wine when the volatile.acidity and sulphates amount are less and alcohol content is high. There is no impact of density and pH on quality of wine.

Reflection

It was thought that pH and density will contribute a major role on the quality of wine, before beginning the bivariate and multivariate analysis. It was only alcohol that played the part to the quality before and after the analysis.

After the analysis it was found that high amount of alcohol and less amount of sulphates and volatile acidity can produce good quality wines.

For future work if the dataset with good and bad rating wines is procured, the variables impacting the quality can be better determined.